Process:
1st) Simulate population dataset based on questions from recruitment form. This represents a rough guess at the total population of those living in 30 - 60% AMI in Boulder City. This fake dataset is based on initial estimates and/or guesses on demographic parameters (including what the parameters should be).
2nd) Randomly sample 4000 applicants from the simulated population data.
3rd) Select first ‘wave’ of 200 program selections using weighted samples.
4th) Select second and third waves using propensity score matching against the applicant pool.
5th) Make list of additional backups to use for additional verifications if needed. Define a process for selecting these additional backup selections, based on prioritizing the least represented groups.
Last) Make the dataset with selections and backups available for download
There will be enough recruits into the program that we can have multiple waves of selections within the weighting criteria we define.
Failures of verification will be ~randomly distributed across groups.
For the sake of the simulations and calculations here (which are just for an abstract presentation of the process), assume there will be 4000 applicants, 200 selections, and 200 backups.
For the purposes of weighting, assume groups are independent. That is, we have estimates for the proportion of the population by racial category and we use these weights to make a random selection, likewise with gender, and disability, etc.
Ideally, make all matches based on estimates of population in Boulder City who are either a) between 30 and 60 % of area median income (AMI) or b) below poverty line. Option a is preferable - b is backup if we encounter data limilations.
Proportionate match by race/ethnicty
Proportionate match by gender identity
Individuals with children under 18 should be represented at 2xs their estimated representation in the population (population is from first bullet on income)
Proportionate match by disability status
The eligibility questionnaire will have questions on each of the above, plus additional eligibility and other characteristics not addressed here.
Ethnicity/race options:
Non-Latino White (e.g., German, Irish, English, Italian)
Hispanic, Latinx, or Spanish origin (e.g., Mexican/Mexican American, Puerto Rican, Cuban, Dominican, Salvadoran, Colombian)
Black or African American (e.g., African American, Jamaican, Haitian, Nigerian, Ethiopian, Somalian)
Asian (e.g., Chinese, Filipino, Asian Indian, Vietnamese, Korean, Japanese)
American Indian or Alaska Native (e.g., Navajo Nation, Blackfeet Tribe, Muscogee (Creek) Nation, Mayan, Doyon, Native Village of Barrow Inupiat Traditional Government)
Native Hawaiian or Other Pacific Islander (e.g., Native Hawaiian, Samoan, Guamanian or Chamorro, Tongan, Fijian, Marshallese)
Middle Eastern or North African (e.g., Lebanese, Egyptian)
Not Listed (please specify)
Gender:
Woman
Man
Transgender
Non-binary/Gender non-conforming
Prefer to self identify (please write in your preferred identity here)
Households with children under 18
calculated from general question on household composition, which includes a relationship and birthday question, which are in turn used to calculate if household has children under 18
assume this is a binary variable 1/0 for 1 = household with children under 18
Disability status:
This table shows the probabilities that we are working with in the current iteration of our fake data. These are a combination of empirical estimates and rough guesses (for now).
| sub_group | props | props_raw |
|---|---|---|
| race_ethnicity | ||
| White (not latino) | 0.756 | 0.756 |
| Hispanic | 0.100 | 0.100 |
| Black or African American | 0.014 | 0.014 |
| Asian | 0.051 | 0.051 |
| American Indian or Alaska Native | 0.002 | 0.002 |
| Native Hawaiian or Other Pacific Islander | 0.001 | 0.001 |
| Middle Eastern or North African | 0.038 | 0.038 |
| Not Listed | 0.038 | 0.038 |
| gender | ||
| Woman | 0.398 | 0.398 |
| Man | 0.502 | 0.502 |
| Transgender | 0.030 | 0.030 |
| Non-binary/Gender non-conforming | 0.030 | 0.030 |
| Prefer to self identify | 0.040 | 0.040 |
| child_household | ||
| No | 0.600 | 0.800 |
| Yes | 0.400 | 0.200 |
| disability | ||
| None | 0.850 | 0.850 |
| Disability1 | 0.050 | 0.050 |
| Disability2 | 0.050 | 0.050 |
| Disability3 | 0.050 | 0.050 |
This table shows the sums across sub-groups as an initial internal check. They should generally sum to 1. The values for child household have already been manipulated to ensure twice as many households with children are included.
| group | group_sum |
|---|---|
| child_household | 1 |
| disability | 1 |
| gender | 1 |
| race_ethnicity | 1 |
Fake data for an arbitrary notion of the ‘total population’. This means all the people in Boulder living between 30 and 60% AMI. Right now this is 25000 people.
A few example rows from the simulated population sample:
| id | race_ethnicity | gender | child_household | disability |
|---|---|---|---|---|
| 18190 | White (not latino) | Woman | No | None |
| 18374 | White (not latino) | Woman | Yes | None |
| 1018 | White (not latino) | Woman | Yes | Disability2 |
| 3145 | White (not latino) | Woman | No | None |
| 23489 | White (not latino) | Man | Yes | None |
| 8901 | Asian | Non-binary/Gender non-conforming | Yes | Disability3 |
Randomly select 4000 from the population.
| sub_group | count | proportions | target_proportions | props_raw |
|---|---|---|---|---|
| child_household | ||||
| No | 2433 | 0.608 | 0.600 | 0.800 |
| Yes | 1567 | 0.392 | 0.400 | 0.200 |
| disability | ||||
| Disability1 | 213 | 0.053 | 0.050 | 0.050 |
| Disability2 | 213 | 0.053 | 0.050 | 0.050 |
| Disability3 | 196 | 0.049 | 0.050 | 0.050 |
| None | 3378 | 0.845 | 0.850 | 0.850 |
| gender | ||||
| Man | 1995 | 0.499 | 0.502 | 0.502 |
| Non-binary/Gender non-conforming | 133 | 0.033 | 0.030 | 0.030 |
| Prefer to self identify | 180 | 0.045 | 0.040 | 0.040 |
| Transgender | 123 | 0.031 | 0.030 | 0.030 |
| Woman | 1569 | 0.392 | 0.398 | 0.398 |
| race_ethnicity | ||||
| American Indian or Alaska Native | 9 | 0.002 | 0.002 | 0.002 |
| Asian | 198 | 0.050 | 0.051 | 0.051 |
| Black or African American | 57 | 0.014 | 0.014 | 0.014 |
| Hispanic | 411 | 0.103 | 0.100 | 0.100 |
| Middle Eastern or North African | 137 | 0.034 | 0.038 | 0.038 |
| Native Hawaiian or Other Pacific Islander | 5 | 0.001 | 0.001 | 0.001 |
| Not Listed | 158 | 0.040 | 0.038 | 0.038 |
| White (not latino) | 3025 | 0.756 | 0.756 | 0.756 |
Note: as a reminder/clarifier, in the above table the ‘proportions’ column is what we observe when we select 4000 rows/individuals from our simulated population data. The target_proportions are the values used to simulate the population data. These values will generally be very similar because when you sample a large-ish population at random you will mostly tend to maintain the proportions of its characteristic parts. No weighting is applied at this step because we assume that those who apply to the program are something like a random sample of all those who could apply (the ‘population’).
The second wave selection works by taking the wave 1 selection and then using an algorithm to find each individuals closest match from the 3800 individuals remaining in the applicat pool. This is done using a technique called propensity score matching.
## [1] TRUE
First, lets compare the population data to the applicant data:
| group | sub_group | target_props | props_raw | ideal_counts | count_applicant | props_applicant | count_w1 | props_w1 | count_w2 | props_w2 | count_w3 | props_w3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| race_ethnicity | Native Hawaiian or Other Pacific Islander | 0.001 | 0.001 | 1 | 5 | 0.001 | 3 | 0.015 | 2 | 0.010 | NA | NA |
| race_ethnicity | American Indian or Alaska Native | 0.002 | 0.002 | 1 | 9 | 0.002 | 2 | 0.010 | 1 | 0.005 | 1 | 0.005 |
| race_ethnicity | Black or African American | 0.014 | 0.014 | 3 | 57 | 0.014 | 13 | 0.065 | 12 | 0.060 | 11 | 0.055 |
| gender | Transgender | 0.030 | 0.030 | 6 | 123 | 0.031 | 28 | 0.140 | 30 | 0.150 | 29 | 0.145 |
| gender | Non-binary/Gender non-conforming | 0.030 | 0.030 | 6 | 133 | 0.033 | 18 | 0.090 | 17 | 0.085 | 24 | 0.120 |
| race_ethnicity | Middle Eastern or North African | 0.038 | 0.038 | 8 | 137 | 0.034 | 24 | 0.120 | 27 | 0.135 | 27 | 0.135 |
| gender | Prefer to self identify | 0.040 | 0.040 | 8 | 180 | 0.045 | 6 | 0.030 | 5 | 0.025 | 4 | 0.020 |
| race_ethnicity | Not Listed | 0.038 | 0.038 | 8 | 158 | 0.040 | 8 | 0.040 | 9 | 0.045 | 8 | 0.040 |
| disability | Disability2 | 0.050 | 0.050 | 10 | 213 | 0.053 | 15 | 0.075 | 14 | 0.070 | 7 | 0.035 |
| race_ethnicity | Asian | 0.051 | 0.051 | 10 | 198 | 0.050 | 10 | 0.050 | 9 | 0.045 | 11 | 0.055 |
| disability | Disability3 | 0.050 | 0.050 | 10 | 196 | 0.049 | 11 | 0.055 | 10 | 0.050 | 7 | 0.035 |
| disability | Disability1 | 0.050 | 0.050 | 10 | 213 | 0.053 | 8 | 0.040 | 8 | 0.040 | 9 | 0.045 |
| race_ethnicity | Hispanic | 0.100 | 0.100 | 20 | 411 | 0.103 | 14 | 0.070 | 14 | 0.070 | 15 | 0.075 |
| child_household | Yes | 0.400 | 0.200 | 80 | 1567 | 0.392 | 81 | 0.405 | 80 | 0.400 | 90 | 0.450 |
| gender | Woman | 0.398 | 0.398 | 80 | 1569 | 0.392 | 62 | 0.310 | 64 | 0.320 | 60 | 0.300 |
| gender | Man | 0.502 | 0.502 | 100 | 1995 | 0.499 | 86 | 0.430 | 84 | 0.420 | 83 | 0.415 |
| child_household | No | 0.600 | 0.800 | 120 | 2433 | 0.608 | 119 | 0.595 | 120 | 0.600 | 110 | 0.550 |
| race_ethnicity | White (not latino) | 0.756 | 0.756 | 151 | 3025 | 0.756 | 126 | 0.630 | 126 | 0.630 | 127 | 0.635 |
| disability | None | 0.850 | 0.850 | 170 | 3378 | 0.845 | 166 | 0.830 | 168 | 0.840 | 177 | 0.885 |
## # A tibble: 4 × 2
## group group_props
## <chr> <dbl>
## 1 child_household 1
## 2 disability 1
## 3 gender 1
## 4 race_ethnicity 1
We can examine just the race and gender breakdowns, above, to see that randomly sampling 4000 individuals from our population of 25000 leads to proportions in each group that are fairly similar.
Next, we can see how the proportions in each sampling wave compare to the ‘ideal’ proportions in the population data: